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Abstract 

Here we propose a novel model family with 
the objective of learning to disentangle the 
factors of variation in data. Our approach is 
based on the spike- and- slab restricted Boltz- 
mann machine which we generalize to include 
higher-order interactions among multiple la- 
tent variables. Seen from a generative per- 
spective, the multiplicative interactions em- 
ulates the entangling of factors of variation. 
Inference in the model can be seen as dis- 
entangling these generative factors. Unlike 
previous attempts at disentangling latent fac- 
tors, the proposed model is trained using no 
supervised information regarding the latent 
factors. We apply our model to the task of 
facial expression classification. 



1. Introduction 

In many machine learning tasks, data originates from 
a generative process involving complex interaction of 
multiple factors. Alone each factor accounts for a 
source of variability in the data. Together their inter- 
action gives rise to the rich structure characteristic of 
many of the most challenging domains of application. 
Consider, for example, the task of facial expression 
recognition. Two images of different individuals with 
the same facial expression may result in images that 
are well separated in pixel space. On the other hand, 
two images of the same individuals showing different 
expressions may well be positioned very close together 
in pixel space. In this simplified scenario, there are 
two factors at play: (1) the identity of the individual, 
and (2) the facial expression. One of these factors, 
the identity, is irrelevant to the task of facial expres- 
sion recognition and yet of the two factors it could 
well dominate the representation of the image in pixel 
space. As a result, pixel space-based facial expression 
recognition systems seem likely to suffer poor perfor- 
mance due to the variation in appearance of individual 



faces. 

Importantly, these interacting factors frequently do 
not combine as simple superpositions that can be eas- 
ily separated by choosing an appropriate affinc projec- 
tion of the data. Rather, these factors often appear 
tightly entangled in the raw data. Our challenge is to 
construct representations of the data that cope with 
the reality of entangled factors of variation and pro- 
vide features that may be appropriate to a wide variety 
of possible tasks. In the context of our face data exam- 
ple, a representation capable of disentangling identity 
and expression would be an effective representation for 
either the facial recognition or facial expression classi- 
fication. 

In an effort to cope with these factors of variation, 
there has been a broad-based movement in machine 
learning and in application domains such as computer 
vision toward hand-engineering feature sets that are 
invariant to common sources of variation in data. This 
is the motivation behind both the inclusion of fea- 
ture pooling stages in the convolutional network ar- 



chitecture ( LeCun et al. , 1989 ) and the recent trend 



toward representations based on large scale pooling of 
low-level features ( Wang et al. 2009 Coates et al. 



2011). These approaches all stem from the powerful 
idea that invariant features of the data can be induced 
through the pooling together of a set of simple filter re- 
sponses. Potentially even more powerful is the notion 
that one can actually learn which filters to be pooled 
together from purely unsupervised data, and thereby 
extract directions of variance over which the pooling 



features become invariant ( Kohonen et al. 


1979 Ko- 


honen[ 1996 Hyvarinen and Hoycr 2000 


Le et al. 


2010 


Kavukcuoglu et al. 2009 Ranzato and Hinton 


2010 


Courville et al.| 2011b I. However, in 


situations 



where there are multiple relevant but entangled fac- 
tors of variation that give rise to the data, we require 
a means of feature extraction that disentangles these 
factors in the data rather than simply learn to repre- 
sent some of these factors at the expense of those that 
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are lost in the filter pooling operation. 

Here we propose a novel model family with the ob- 
jective of learning to disentangle the factors of varia- 
tion evident in the data. Our approach is based on 
the spike-and-slab restricted Boltzmann machine (ss- 
RBM) ( |Courville et al. 2011a) which has recently been 
shown to be a promising model of natural image data. 
We generalize the ssRBM to include higher-order inter- 
actions among multiple binary latent variables. Seen 
from a generative perspective, the multiplicative in- 
teractions of the binary latent variables emulates the 
entangling of the factors that give rise to the data. 
Conversely, inference in the model can be seen as an 
attempt to assign credit to the various interacting fac- 
tors for their combined account of the data - in effect, 
to disentangle the generative factors. Our approach re- 
lies only on unsupervised approximate maximum like- 
lihood learning of the model parameters, and as such 
we do not require the use of any label information in 
defining the factors to be disentangled. We believe 
this to be a research direction of critical importance, 
as it is almost never the case that label information 
exists for all factors responsible for variations in the 
data distribution. 



2. Learning Invariant Features Versus 
Learning to Disentangle Features 

The principle that invariant features can actually 
emerge, using only unsupervised learning, from the or- 
ganization of features into subspaces was first estab- 



lished in the ASSOM model ( |Kohonen| |1996| ). Since 
then, the same basic strategy has reappeared in a 
number of different models and learning paradigms, 
including topological independent component analy- 



sis ( jHyvaxinen and Hoyer 2000 Le et ah] 2010[ ) 



invariant predictive sparse decomposition (IPSD) 
(IKavukcuoglu et al. 2009), as well as in Boltzmann 



machine-based approaches ( Ranzato and Hinton| 2010[ 
Courville et al. 2011b[ ). In each case, the basic strat- 
egy is to group filters together by, for example, using a 
variable (the pooling feature) that gates the activation 
for all elements of the group. This gated activation 
mechanism causes the filters within the group to share 
a common window on the dataset, which in turn leads 
to filter groups composed of mutually complementary 
filters. In the end, the span of the filter vectors defines 
a subspace which specifies the directions in which the 
pooling feature is invariant. Somewhat surprisingly, 
this basic strategy has repeatedly demonstrated that 
useful invariant features can be learned in a strictly 
unsupervised fashion, using only the statistical struc- 
ture inherent in the data. While remarkable, one im- 



portant problem with using this learning strategy is 
that the invariant representation formed by the pool- 
ing features offers a somewhat incomplete view on the 
data as the detailed representation of the lower-level 
features is abstracted away in the pooling procedure. 
While we would like higher level features to be more 
abstract and exhibit greater invariance, we have little 
control over what information is lost through feature 
subspace pooling. 

Invariant features, by definition, have reduced sensi- 
tivity in the direction of invariance. This is the goal 
of building invariant features and fully desirable if the 
directions of invariance all reflect sources of variance 
in the data that are uninformative to the task at hand. 
However, it is often the case that the goal of feature 
extraction is the disentangling or separation of many 
distinct but informative factors in the data. In this sit- 
uation, the methods of generating invariant features - 
namely, the feature subspace method - may be inade- 
quate. Returning to our facial expression classification 
example from the introduction, consider a pooling fea- 
ture made invariant to the expression of a subject by 
forming a subspace of low-level filters that represent 
the subject with various facial expressions (forming a 
basis for the subspace). If this is the only pooling 
feature that is associated with the appearance of this 
subject, then the facial expression information is lost 
to the model representation formed by the set of pool- 
ing features. As illustrated in our hypothetical facial 
expression classification task, this loss of information 
becomes a problem when the information that is lost 
is necessary to successfully complete the task at hand. 

Obviously, what we really would like is for a particu- 
lar feature set to be invariant to the irrelevant features 
and disentangle the relevant features. Unfortunately, 
it is often difficult to determine a priori which set 
of features will ultimately be relevant to the task at 
hand. Further, as is often the case in the context of 



deep learning methods (Collobert and Weston 20081, 



the feature set being trained may be destined to be 
used in multiple tasks that may have distinct subsets 
of relevant features. Considerations such as these lead 
us to the conclusion that the most robust approach 
to feature learning is to disentangle as many factors 
as possible, discarding as little information about the 
data as is practical. This is the motivation behind our 
proposed higher-order spike-and-slab Boltzmann ma- 
chine. 
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Figure 1. Energy function of our higher-order spike & slab RBM (ssRBM), used to disentangle (multiplicative) factors of 
variation in the data. Two groups of latent spike variables, g and h, interact to explain the data v, through the weight 
tensor W. While the ssRBM instantiates a slab variable Sj for each hidden unit hj, our higher-order model employs a 
slab Sij for each pair of spike variables (tjt,hj). /ity and ctij are respectively the mean and precision parameters of Sij. An 
additional set of spike variables / are used to gate groups of latent variables h, g and serve to promote group sparsity. 
Most parameters are thus indexed by an extra subscript k. Finally, e, c and d are standard bias terms for variables /, g 
and h, while A is a diagonal precision matrix on the visible vector. 



3. Higher-order Spike-and-Slab 
Boltzmann Machines 

In this section, we introduce a model which makes 
some progress toward the ambitious goal of disentan- 
gling factors of variation. The model is based on the 
Boltzmann machine, an undirected graphical model. 
In particular we build on the spike-and-slab restricted 
Boltzmann Machine (ssRBM) ( |Courville et al.[|2011bj ), 
a model family that has previously shown promise as 
a means of learning invariant features via subspace 
pooling. The original ssRBM model possessed a lim- 
ited form of higher-order interaction of two latent ran- 
dom variables: the spike and the slab Our extension 
adds higher-order interactions between four distinct 
latent random variables. These include one set of 
slab variables and three interacting binary spike vari- 
ables. Unlike the ssRBM, the interactions between 
the latent variables violate the conditional indepen- 
dence constraint of the restricted Boltzmann machine 
and therefore does not belong to this class of models. 
As a consequence, exact inference in the model is not 
tractable and we resort to a mean-field approximation. 

Our strategy in promoting this model is that we in- 
tend to disentangle factors of variation via inference 
(recovering the posterior distribution over our latent 
variables) in a generative model. In the context of 
generative models, inference can roughly be thought 
of as running the generative process in reverse. Thus 
if we wish our inference process to disentangle factors 
of variation, our generative process should describe a 
means of factor entangling. The generative model we 
propose here represents one possible means of factor 
entangling. 

Let v G WL D be the random visible vector that rep- 
resents our observations with its mean zeroed. We 
build a latent representation of this data with binary 
latent variables / G {0,1} K , g G {0,l} MxK and 



h G {0, l} NxK . In the spike-and-slab context, we can 
think of /, g and h as a factored representation of the 
"spike" variables. We also include a set of real val- 
ued "slab" variables s G R MxJVxA " j with element sij k 
associated with hidden units f k , gi k and hj k . The in- 
teraction between these variables is defined through 
the energy function of Fig. [T] 

The parameters are defined as follows. W G 
^DxMxNxK j g a we jg n t; 4-tensor connecting visible 
units to the interacting latent variables, these can 
be interpreted as forming a basis in image space; 



H G 



vMxNxK 



and a G 



pMxNxK 



are tensors describ- 



vDxD 



is 



ing the mean and precision of each Sij k ; A G 
a diagonal precision matrix on the visible vector; and 
finally c G R MxK , d G R NxK and e e R K are bi- 
ases on the matrices g, h and vector / respectively. 
The energy function fully specifies the joint proba- 
bility distribution over the variables v, s ,/, g and 
h: p(v,s,f,g,h) = \ exp{-E(v,s, f,g,h)} where Z 
is the partition function which ensures that the joint 
distribution is normalized. 

As specified above, the energy function is similar to the 
ssRBM energy function (Courville et al. 2011b|a ), but 
includes a factored representation of the standard ss- 
RBM spike variable. Yet, clearly the properties of the 
model are highly dependent on the topology of the in- 
teractions between the real- valued slab variables Sij k , 
and three binary spike variables f k , gi k and hj k . We 
adopt a strategy that permits local interactions within 
small groups of /, g and h in a block-like organizational 
pattern as specified in Fig. [2j The local block struc- 
ture allows the model to work incrementally towards 
disentangling the features by focusing on manageable 
subparts of the problem. 

Similar to the standard spike-and-slab restricted 
Boltzmann machine (Courville et al. 2011b|a ), the en- 
ergy function in Eq. 1] gives rise to a Gaussian condi- 
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Figure 2. Block-sparse connectivity pattern with dense in- 
teractions between g and h within each block (only shown 
for fc-th block). Each block is gated by a separate fk vari- 
able. 



tional over the visible variables: 



P(v | s,f,g,h) 




l W.ijkSijkgikhjkfk , A 



Here we have a four-way multiplicative interaction in 
the latent variables s, f, g and h. The real-valued 
slab variable Sij k acts to scale the contribution of the 
weight vector W.ijk. As a consequence, after marginal- 
izing out s, the factors /, g and h can also be seen as 
contributing both to the conditional mean and condi- 
tional variance of p(v \ f,g, h): 

p(v | f,g,h) =AM ^ A / 
\i,j,k 

C v \f, g ,h = I A - ]T W. ijk W T ijk aT^ k g ik h jk fk 

This is an important property of the spike-and-slab 
framework that is also shared by other latent vari- 
able models of real-valued data such as the mean- 
covariance restricted Boltzmann machine (mcRBM) 
( Ranzato and Hinton |2010 ) and the mean Product of 
T-distributions model (mPoT) ( |Ranzato etahj |2010| ). 



From a generative perspective, the model can be 
thought of as consisting of a set of K factor blocks 
whose activity is gated by the / variables. Within 
each block, the variables g. k and h. k can be thought 
of as local latent factors whose interaction gives rise 
to the active block's contribution to the visible vec- 
tor. Crucially, the multiplicative interaction between 



the g. k and h. k for a given block k is mediated by the 
weight tensor W. t . t .. k and the corresponding slab vari- 
ables s.,.,fc. Contrary to more standard probabilistic 
factor models whose factors simply sum to give rise 
to the visible vector, the individual contributions of 
the elements of g. k and h. k are not easily isolated from 
one another. We can think of the generative process 
as entangling the local block factor activations. 

From an encoding perspective, we are interested in us- 
ing the posterior distribution over the latent variables 
as a representation or encoding of the data. Unlike 
in RBMs, in the case of the proposed model where 
we have higher-order interactions over the latent vari- 
ables, the posterior over the latent variables does not 
factorize cleanly. By marginalizing over the slab vari- 
ables s, we can recover a set of conditionals describing 
how the binary latent variables /, g and h interact. 
The conditional P(f | v,g, h) is given below. 



P(fk = 1 \v,g, h) = sigm c lk + v T W.ij k Hij k gi k hj k - 
^5Z%-fe [ vTw -ijk] 2 9ikh jk 

i,j J 



It illustrates that with the factor configuration given 
in Fig.[2j the factors f k are activated (assume value 1) 
through the sum-pooled response of all the weight vec- 
tors W. ijk (VI < i < M and 1 < j < N) differentially 
gated by the values of gt k and hj k , whose conditionals 
are respectively given by: 



P {gik = 1 \v, /, h) = sigm I Cik • ^ f' 1 1*.,, ;./',, ,,//,;,//. 

1 N N 

2 5Z%fe [ vT W. ijk ] hjkfk 

( M 

P(h jk = 1 \v, /, g) = sigm dj k + ^ vT W.ijk fMjk9ikfk- 

1 M 2 \ 

o ^2 a ijk [ yTW ^k] 2 gtkfk 



For completeness, we also include the Gaussian condi- 
tional distribution over the slab variables s 



p(sijk \ v,f,g,ti)=Af ( 



fk9ikhj k , a 



From an encoding perspective, the gating pattern on 
the g and h variables, evident from Fig. [2] and from 
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the conditionals distributions, defines a form of local 



bilinear interaction ( |Tenenbaum and Freeman 2000). 
We can interpret the values of gik and hjk within block 
k acting as basis indicators, in dimensions i and j, 
for the linear subspace in the visible space defined by 
W.ijkSijk- 

From this perspective, we can think of \g.k,h.K\ as 
defining a block-local binary coordination encoding of 
the data. Consider the case illustrated by Fig. [2] where 
we have M = 5, N = 5 and the number of blocks (K) 
is 4. For each block, we have M x N — 25 filters 
which we encode using M + N = 10 binary latent 
variables, where each gik (alternately hjk) effectively 
pools over the subspace characterized by the variables 
hjk, 1 < j < N (alternately gik, 1 < i < M) through 
their relative interaction with W.ijkSijk- As a concrete 
example, imagine that the structure of the weight ten- 
sor was such that, along the dimension indexed by i, 
the weight vectors W.ijk form oriented Gabor-like edge 
detectors of different orientations. Yet along the di- 
mension indexed by j, the weight vectors W.ijk form 
oriented Gabor-like edge detectors of different colors. 
In this hypothetical example, gik encodes orientation 
information while being invariance to the color of the 
edge, while hjk encodes color information while being 
invariant to orientation. Hence we could say that we 
have disentangled the latent factors. 

3.1. Higher-order Interactions as a Multi-Way 
Pooling Strategy 

As alluded to above, one interpretation of the role of 
g and h is as distinct and complementary sum-pooled 
feature sets. Returning to Fig. [2j we can see that, 
for each block, the gik pool across the columns of 
the kth block, along the ith row, while the h.k pool 
across rows, along the j'th column. The / variables 
are also interpretable as pooling across all elements of 
the block. One way to interpret the complementary 
pooling structures of the g and ft is as a multi-way 
pooling strategy. 

This particular pooling structure was chosen to study 
the potential of learning the kind of bilinear interaction 
that exists between the g.k and h.k within a block. The 
fk are present to promote block cohesion by gating 
the interaction of between g.k and h.k and the visible 
vector v. 

This higher-order structure is of course just one choice 
of many possible higher-order interaction architec- 
tures. One can easily imagine defining arbitrary over- 
lapping pooling regions, with the number of overlap- 
ping pooling regions specifying the order of the latent 
variable interaction. We believe that explorations of 



overlapping pooling regions of this type is a promising 
direction of future inquiry. One potentially interesting 
direction is to consider overlapping blocks (such as our 
/ blocks). The overlap will define a topology over the 
features as they will share lower-level features (i.e. the 
slab variables). A topology thus defined could poten- 
tially be exploited to build higher-level data represen- 
tations that possess local receptive fields. These kind 
of local receptive fields have been shown to be useful 
in building large and deep models that perform well 



in object classification tasks in natural images ( Coates 



et al. 20111 



3.2. Variational inference and unsupervised 
learning 

Due to the multiplicative interaction between the la- 
tent variables /, g and h, computation of P(f \ v), 
P(g | v) and P(h \ v) is intractable. While the slab vari- 
ables also interact multiplicatively, we are able to an- 
alytically marginalize over them. Consequently we re- 
sort to a variational approximation of the joint condi- 
tional P(f, g, h | v) with the standard mean-field struc- 
ture, i.e. we choose Q v {f, g, h) = Q v (f)Qv{g)Qv(h) such 
that the KL divergence KL(Q v (f,g,h)\\P(f,g,h | v)) is 
minimized, or equivalently, that the variational lower 
bound C(Q V ) on the log likelihood of the data is max- 
imized: 



max£(<3„) = max ^ Qv(f)Qv{g)Qv(h) 
f,g,h 

P(f,9,h | v) 



log 



Qv(f)Qv(g)Qv(h) 



where the sums are taken over all values of the el- 
ements of /, g and h respectively. Maximizing this 
lower bound with respect to the variational param- 
eters f k = Q v (fk = 1), 9ik = Qv(gik = 1) and 
hjk = Q v (hjk = 1), results in the set of approximating 
factored distributions: 



fk = sigm j c ik + ^2 ''' 11 '.M/'.,/.//././'.;/. 

\Yl a ^k [« T W-«*] 2 g\khjk i ■ 

it?' 

N ^ 

g ik = sigm ( c ik + ^ '•' U:,,,v//,,/ h ,,,f.. 

i ) 

1 N 2* A 

f 2^ a ^ [ vTw -^] 2 h ]k fk \ , 
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/ M 

hj k = sigm d jk + ^ vT W. ijk Hijkgikfk 

M 



4£ a tffc [ yTw ^k] 2 9ikfk 



The above equations form a set of fixed point equations 
which we iterate until the values of all Q v {fk), Qv{gik) 
and Q v (hjk) converge. Since the expression for //. does 
not depend on jv, Vfc', gi k does not depend on gi>k', 
, £;', and hjk does not depend on hfk<, Vi', k', we can 
define a three stage update strategy where we update 
the values of all K values of / in parallel, then update 
all K x M values of g in parallel and finally update all 
K x N values of h in parallel. 



Following the variational EM training approach (Saul 
et al. , 1996 1, we alternately maximize the lower bound 



£{Qv) with respect to the variational parameters /, 
g and h (E-step) and maximizing C(Q V ) with re- 
spect to the model parameters (M-step). The gra- 
dient of C(Q V ) with respect to the model parameters 
9 = {W, n, a, A, b, c, d, e} is given by: 



dC(Q v ) 



qq — X] Qv{f)Qv{g)Qv{h)^p(s\vJ,gJi) 



f,g,h 



dE 
50 



-\- uLp(v.s,g,h) 



dE 
~d~9 



where E is the energy function given in Eq. [T] As is 
evident from Eq. ??, the gradient of C(Q V ) with re- 
spect to the model parameters contains two terms: a 
positive phase that depends on the data v and a neg- 
ative phase, derived from the partition function of the 
joint p(v, s, /, <?, h) that does not. We adopt a training 
strategy similar to that of (Salakhu tdinov and Hinton| 
2009 ) , in that we combine a variational approximation 



of the positive phase of the gradient with a block Gibbs 
sampling-based stochastic approximation of the nega- 
tive phase. Our Gibbs sampler alternately samples, in 
parallel, each set of random variables, sampling from 
P(f I v,g,h),p(g | vj, h),p(h | v,f,g),p(s | v,f,g,h), 
and finally sampling from p(v | /, g, h, s). 

3.3. The Challenge of Unsupervised Learning 
to Disentangle 

Above we have briefly outline our procedure for train- 
ing the unsupervised learning. The web of interac- 
tions between the latent random variables, particularly 
those between g and h, makes the unsupervised learn- 
ing of the model parameters a particularly challeng- 
ing learning problem. It is the difficultly of learning 



that motivates our block-wise organization of the in- 
teractions between the g and h variables. The block 
structure allows the interactions between g and h to 
remain local, with each g interacting with relatively 
few h and each h interacting with relatively few g. 
This local neighborhood structure allows the inference 
and learning procedures to better manage the com- 
plexities of teasing apart the latent variable interac- 
tions and adapting the model parameters to (approx- 
imately) maximize likelihood. 

By using many of these blocks of local interactions we 
can leverage the known tractable learning properties 
of models such as the RBM. Specifically, if we con- 
sider each block as a kind of super hidden unit gated 
by /, then with no interactions across blocks (apart 
from those mediated by the mutual connections to the 
visible units) the model assumes the form of an RBM. 

While our chosen interaction structure allows our 
higher-order model to be able to learn, one conse- 
quence is that the model is only capable of disentan- 
gling relatively local factors that appear within a single 
block. We suggest that one promising avenue to ac- 
complish more extensive disentangling is to consider 
stacking multiple version of the proposed model and 
consider layer-by-layer disentangling of the factors of 
variation present in the data. The idea is to start with 
local disentangling and move gradually toward disen- 
tangling non local and more abstract factors. 

4. Related Work 

The model proposed here was strongly influenced by 
previous attempts to disentangle factors of variation in 
data using latent variable models. One of the earlier 
efforts in this direction also used higher-order interac- 



tions of latent variables, specifically bilinear (Tenen- 


baum and Freeman 


2000 


Grimes and Rao 2005 


) and 


multilinear ( 


Vasilescu and Terzopoulos 


2005 


I mod- 



els. One critical difference between these previous 
attempts to disentangle factors of variation and our 
method is that unlike these previous methods, we are 
attempting to learn to disentangle from entirely unsu- 
pervised information. In this way, one can interpret 
our approach as an attempt to extend the subspace 
feature pooling approach to the problem of disentan- 
gling factors of variation. 

Bilinear models are essentially linear models where the 
higher-level state is factored into the product of two 
variables. Formally, the elements of observation x are 
given by x k = J2i J2j WyfcJ/i«j> VJfe, where yi and z are 
elements of the two factors (y and z) representing the 
observation and Wijk is an element of the tensor of 



model parameters (Tenenbaum and Freeman 20001 
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The tensor W can be thought of as a generalization 
of the typical weight matrix found in most unsuper- 
vised models we have considered above. (Tenenbaum 



and Freeman 2000 1 developed an EM-based algorithm 
to learn the model parameters and demonstrated, us- 
ing images of letters from a set of distinct fonts, that 
the model could disentangle the style (font characteris- 
tics) from content (letter identity). (Grimes and Rao 



2005) later developed a bilinear sparse coding model 
of a similar form as described above but included ad- 
ditional terms to the objective function to render the 
elements of both y and z sparse. They also require 
observation of the factors in order to train the model, 
and used the model to develop transformation invari- 
ant features of natural images. Multilinear models are 
simply a generalization of the bilinear model where the 
number of factors that can be composed together is 2 
or more. (Vasilescu and Terzopoulos 2005) develop a 



multilinear ICA model, which they use to model im- 
ages of faces, to disentangle factors of variation such 
as illumination, views (orientation of the image plane 
relative to the face) and identities of the people. 



Hinton et al. (2011 ) also propose to disentangle factors 
of variation by learning to extract features associated 
with pose parameters, where the changes in pose pa- 
rameters (but not the feature values) are known at 
training time. The proposed model is also closely re- 



lated to recent work (Memisevic and Hinton 2010), 



where higher-order Boltzmann Machines are used as 
models of spatial transformations in images. While 
there are a number of differences between this model 
and ours, the most significant difference is our use 
of multiplicative interactions between latent variables. 
While they included higher-order interactions within 
the Boltzmann energy function, they were used exclu- 
sively between observed variables, dramatically simpli- 
fying the inference and learning procedures. Another 
major point of departure is that instead of relying on 
low-rank approximations to the weight tensor, our ap- 
proach employs highly structured and sparse connec- 
tions between latent variables (e.g. gik is not inter- 
act with or hjk> for k' ^ k), reminiscent of recent 



coding ( Gregor et al. 


2011) 


(Bach et al. 




2011 


). As dis- 



cussed above, our use of a sparse connection structure 
allows us to isolate groups of interacting latent vari- 
ables. Keeping the interactions local in this way, is 
a key component of our ability to successfully learn 
using only unsupervised data. 




Figure 3. (top) Samples from our synthetic dataset (before 
noise). In each image, a figure "X" can appear at five 
different positions, in one of eight basic colors. Objects 
in a given image must all be of the same color, (bottom) 
Filters learnt by a bilinear ssRBM with M = 3, TV = 5, 
which succesfully show disentangling of color information 
(rows) from position (columns). 



5. Experiments 

5.1. Toy Experiment 

We showcase the ability of our model to disentan- 
gle factors of variation, by training it on a synthetic 
dataset, a subset of which is shown in Fig. [3] (top). 
Each color image, of size 3 x 20 is composed of one 
basic object of varying color, which can appear at five 
different positions. The constraint is that all objects 
in a given image must be of the same color. Additive 
gaussian noise is super-imposed on the resulting im- 
ages to facilitate mixing of the RBM negative phase. 
A bilinear ssRBM with M = 3 and N — 5 should in 
theory have the capacity to disentangle the two factors 
of variation present in the data, as there are 2 3 pos- 
sible colors and 2 5 configurations of object placement. 
The resulting filters are shown in Fig. [3] (bottom): the 
model has succesfully learnt a binary encoding of color 
along g-units (rows) and positions along h (columns). 
Note that this would have been extremely difficult to 
perform without multiplicative interactions of latent 
variables: an RBM with 15 hidden units technically 
has the capacity to learn similar filters, however it 
would be incapable of enforcing mutual exclusivity be- 
tween hidden units of different color. The bilinear ss- 
RBM model on the other hand generates near-perfect 
samples (not shown), while factoring the representa- 
tion for use in deeper layers. 

5.2. Toronto Face Dataset 



We evaluate our model on the recently introduced 
Toronto Face Dataset (TFD) ( |Susskind et~aT7j |2010| ), 
which contains a large number of black & white 48 x 48 
preprocessed facial images. These span a wide range 
of identities and emotions and as such, the dataset 



Disentangling Factors of Variation via Generative Entangling 



is well suited to study the problem of disentangling: 
models which can successfully separate identity from 
emotion should perform well at the supervised learn- 
ing task, which involves classifying images into one 
of seven categories: {anger, disgust, fear, happy, sad, 
surprise, neutral}. The dataset is divided into two 
parts: a large unlabeled set (meant for unsupervised 
feature learning) and a smaller labeled set. Note that 
emotions appear much more prominently in the latter, 
since these are acted out and thus prone to exaggera- 
tion. In contrast, most of the unlabeled set contains 
natural expressions over a wider range of individuals. 

In the course of this work, we have made several key 
refinements to the original spike-and-slab formulation. 
Notably, since the slab variables {sjjfcsVj} can be in- 
terpreted as coordinates in the subspace of the spike 
variable g^ (which spans the set of filters {W. Vj}), 
it is natural for these filters to be unit-norm. Each 
maximum likelihood gradient update is thus followed 
by a projection of the filters onto the unit-norm ball. 
Similarly, there exists an over-parametrization in the 
direction of W.^jk and the sign of /iyfe, the parame- 
ter controlling the mean of Sijk- We thus constrain 
Hijk to be positive, in our case greater than 1. Sim- 
ilar constraints are applied on B and a to ensure 
that the variances on the visible and slab variables re- 



main bounded. While previous work ( Courville et al. 



2011a) used the expected value of the spike variables 



as the input to classifiers, or higher-layers in deep net- 
works, we found that the above re-parametrization 
consistently lead to better results when using the prod- 
uct of expectations of h and s. For pooled models, we 
simply take the product of each binary spike, with the 
norm of its associated slab vector. 

Disentangling Emotion from Identity. We be- 
gin with a qualitative evaluation of our model, by vi- 
sualizing the learned filters (inner-most dimension of 
the matrix W) and pooling structures. We trained a 
model with K = 100 and M = N = 5 (that is to 
say 100 blocks of 5 x 5 interacting g and h units) on 
a weighted combination of the labeled and unlabeled 
training sets. Doing so (as opposed to training on the 
unlabeled set only) allows for greater interpretability 
of the results, as emotion is a more prominent factor 
of variation in the labeled set). The results, shown in 
Figure [4j clearly show global cohesion within blocks 
pooled by /fc, with row and column structure corre- 
lating with variances in appearance/identity and emo- 
tions. 

Disentangling via Unsupervised Feature Learn- 
ing. We now evaluate the representation learnt by 
our disentangling RBM, by measuring its usefulness for 



Figure 4. Example blocks obtained with K = 100, M = 
N = 5. The filters (inner-most dimension of tensor W) in 
each block exhibit global cohesion, specializing themselves 
to a subset of identities and emotions: {happiness, fear, 
neutral} in (left) and {happiness, anger} in (right). In both 
cases, gr-units (which pool over columns) encode emotions, 
while /i-units (which pool over rows) are more closely tied 
to identity. 



the task of emotion recognition. Our main objective 
here is to evaluate the usefulness of disentangling, over 
traditional approaches of pooling, as well as the use of 
larger, unpooled models. We thus consider ssRBMs 
with 3000 and 5000 features, with either (i) no pool- 
ing (i.e. K = 5000 spikes with N = 1 slabs per spike), 
(ii) pooling along a single dimension (i.e. K — 1000 
spike variables, pooling N — 5 slabs) or (iii) disentan- 
gled through our higher-order ssRBM (i.e. K — 200, 
with g and h units arranged in a M X N grid, with 
M = N = 5). 

We followed the standard TFD training protocol of 
performing unsupervised training on the unlabeled set, 
and then using the learnt representation as input to a 
linear SVM, trained and cross- validated on the labeled 
set. Table [T] shows the test accuracy obtained by var- 
ious spike-and-slab models, averaged over the 5-folds. 

We report two sets of numbers for models with pool- 
ing or disentangling: one where we use the "factored 
representation" , which is the element-wise product 
of spike variables with the norm of their associated 
slab vector, and the "unfactored representation": the 
higher-dimensional representation formed by consider- 
ing all slab variables, each multiplied by their associ- 
ated spikes. 

We can see that the higher-order ssRBM achieves the 
best result: 77.4%, using the factored representation. 
The fact that that our model outperforms the "un- 
factored" one, confirms our disentangling hypothesis: 
our model has successfully learnt a lower-dimensional 
(factored) representation of the data, useful for clas- 
sification. For reference, a linear SVM classifier on 



the pixels achieves 71.5% (Susskind et al. 20101, an 
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Table 1. Classification accuracy for Toronto Face Dataset. We compare our higher-order ssRBM for various block sizes K 
and pooling regions M x TV. The comparison is against first-order ssRBMs, which thus pool in a single dimension of size 
TV. First four models contain approximately 3, 000 filters, while bottom four contain 5, 000. In both cases, we compare 
the effect of using the factored representation, to the unfactored representation. 



MLP trained with supervised backprop 72.729^] while 
a deep mPoT model ( |Ranzato et al. 2011[ ), which ex- 



ploits local receptive fields achieves 82.4%. 
6. Conclusion 

We have presented a higher-order extension of the 
spike-and-slab restricted Boltzmann machine that fac- 
tors the standard binary spike variable into three in- 
teracting factors. From a generative perspective, these 
interactions act to entangle the factors represented by 
the latent binary variables. Inference is interpreted as 
a process of disentangling the factors of variation in the 
data. As previously mentioned, we believe an impor- 
tant direction of future research to be the exploration 
of methods to gradually disentangle the factors of vari- 
ation by stacking multiple instantiations of proposed 
model into a deep architecture. 
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